-
Notifications
You must be signed in to change notification settings - Fork 46
Synchronize using MTLSharedEvents
#633
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
MTLSharedEventsMTLSharedEvents
|
Your PR requires formatting changes to meet the project's style guidelines. Click here to view the suggested changes.diff --git a/lib/mtl/events.jl b/lib/mtl/events.jl
index 374f4285..8d2345ed 100644
--- a/lib/mtl/events.jl
+++ b/lib/mtl/events.jl
@@ -29,9 +29,11 @@ function MTLSharedEvent(dev::MTLDevice)
return obj
end
-function waitUntilSignaledValue(ev::MTLSharedEvent, value, timeoutMS=typemax(UInt64))
- @objc [ev::id{MTLSharedEvent} waitUntilSignaledValue:value::UInt64
- timeoutMS:timeoutMS::UInt64]::Bool
+function waitUntilSignaledValue(ev::MTLSharedEvent, value, timeoutMS = typemax(UInt64))
+ return @objc [
+ ev::id{MTLSharedEvent} waitUntilSignaledValue:value::UInt64
+ timeoutMS:timeoutMS::UInt64
+ ]::Bool
end
## shared event handle
diff --git a/src/state.jl b/src/state.jl
index 0782db68..918731c4 100644
--- a/src/state.jl
+++ b/src/state.jl
@@ -61,7 +61,7 @@ end
Return the `MTLSharedEvent` used to synchronize a queue
"""
function queue_event(queue::MTLCommandQueue)
- get!(task_local_storage(), (:MTLSharedEvent, queue)) do
+ return get!(task_local_storage(), (:MTLSharedEvent, queue)) do
MTLSharedEvent(queue.device)
end::MTLSharedEvent
end
@@ -82,7 +82,7 @@ First-In-First-Out manner, this synchronizes the GPU.
cmdbuf = MTLCommandBuffer(queue)
MTL.encode_signal!(cmdbuf, ev, val)
commit!(cmdbuf)
- MTL.waitUntilSignaledValue(ev,val)
+ MTL.waitUntilSignaledValue(ev, val)
return
end
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Metal Benchmarks
| Benchmark suite | Current: 4d18869 | Previous: 422c43f | Ratio |
|---|---|---|---|
latency/precompile |
9811076333 ns |
9811445750 ns |
1.00 |
latency/ttfp |
3976837999.5 ns |
3973553937 ns |
1.00 |
latency/import |
1277901146 ns |
1273914917 ns |
1.00 |
integration/metaldevrt |
820625 ns |
919250 ns |
0.89 |
integration/byval/slices=1 |
1530333 ns |
1620625 ns |
0.94 |
integration/byval/slices=3 |
8915625 ns |
9296666.5 ns |
0.96 |
integration/byval/reference |
1523333 ns |
1601437.5 ns |
0.95 |
integration/byval/slices=2 |
2546958 ns |
2670791 ns |
0.95 |
kernel/indexing |
593166.5 ns |
668834 ns |
0.89 |
kernel/indexing_checked |
578084 ns |
663708 ns |
0.87 |
kernel/launch |
8000 ns |
9042 ns |
0.88 |
array/construct |
6000 ns |
6125 ns |
0.98 |
array/broadcast |
566750 ns |
635500 ns |
0.89 |
array/random/randn/Float32 |
807937 ns |
815104 ns |
0.99 |
array/random/randn!/Float32 |
608417 ns |
635375 ns |
0.96 |
array/random/rand!/Int64 |
543917 ns |
572750 ns |
0.95 |
array/random/rand!/Float32 |
576000 ns |
597958 ns |
0.96 |
array/random/rand/Int64 |
777688 ns |
733583 ns |
1.06 |
array/random/rand/Float32 |
621020.5 ns |
620604 ns |
1.00 |
array/accumulate/Int64/1d |
1285687.5 ns |
1328542 ns |
0.97 |
array/accumulate/Int64/dims=1 |
1813167 ns |
1915812.5 ns |
0.95 |
array/accumulate/Int64/dims=2 |
2151708 ns |
2252917 ns |
0.96 |
array/accumulate/Int64/dims=1L |
11634083.5 ns |
11629709 ns |
1.00 |
array/accumulate/Int64/dims=2L |
9686812 ns |
9677895.5 ns |
1.00 |
array/accumulate/Float32/1d |
1121125 ns |
1230791 ns |
0.91 |
array/accumulate/Float32/dims=1 |
1550417 ns |
1648666.5 ns |
0.94 |
array/accumulate/Float32/dims=2 |
1865291 ns |
1970042 ns |
0.95 |
array/accumulate/Float32/dims=1L |
9850937 ns |
9975938 ns |
0.99 |
array/accumulate/Float32/dims=2L |
7285729 ns |
7387104.5 ns |
0.99 |
array/reductions/reduce/Int64/1d |
1432625 ns |
1582500 ns |
0.91 |
array/reductions/reduce/Int64/dims=1 |
1045959 ns |
1150791.5 ns |
0.91 |
array/reductions/reduce/Int64/dims=2 |
1170750 ns |
1302979 ns |
0.90 |
array/reductions/reduce/Int64/dims=1L |
2074667 ns |
2153834 ns |
0.96 |
array/reductions/reduce/Int64/dims=2L |
3416958 ns |
3552729 ns |
0.96 |
array/reductions/reduce/Float32/1d |
988125 ns |
1047958 ns |
0.94 |
array/reductions/reduce/Float32/dims=1 |
797437.5 ns |
875334 ns |
0.91 |
array/reductions/reduce/Float32/dims=2 |
745167 ns |
801083 ns |
0.93 |
array/reductions/reduce/Float32/dims=1L |
1733083 ns |
1800646 ns |
0.96 |
array/reductions/reduce/Float32/dims=2L |
1754125 ns |
1880520.5 ns |
0.93 |
array/reductions/mapreduce/Int64/1d |
1445437.5 ns |
1555709 ns |
0.93 |
array/reductions/mapreduce/Int64/dims=1 |
1057979.5 ns |
1139791 ns |
0.93 |
array/reductions/mapreduce/Int64/dims=2 |
1179167 ns |
1276833.5 ns |
0.92 |
array/reductions/mapreduce/Int64/dims=1L |
2092312 ns |
2162875 ns |
0.97 |
array/reductions/mapreduce/Int64/dims=2L |
3438542 ns |
3576375 ns |
0.96 |
array/reductions/mapreduce/Float32/1d |
1003208 ns |
1077521 ns |
0.93 |
array/reductions/mapreduce/Float32/dims=1 |
783083 ns |
869417 ns |
0.90 |
array/reductions/mapreduce/Float32/dims=2 |
747333.5 ns |
796458 ns |
0.94 |
array/reductions/mapreduce/Float32/dims=1L |
1731000 ns |
1791958 ns |
0.97 |
array/reductions/mapreduce/Float32/dims=2L |
1767250 ns |
1892687.5 ns |
0.93 |
array/private/copyto!/gpu_to_gpu |
629667 ns |
654917 ns |
0.96 |
array/private/copyto!/cpu_to_gpu |
798334 ns |
822333 ns |
0.97 |
array/private/copyto!/gpu_to_cpu |
777667 ns |
815083 ns |
0.95 |
array/private/iteration/findall/int |
1641458 ns |
1662667 ns |
0.99 |
array/private/iteration/findall/bool |
1443125 ns |
1504500 ns |
0.96 |
array/private/iteration/findfirst/int |
1817375 ns |
1894188 ns |
0.96 |
array/private/iteration/findfirst/bool |
1792250 ns |
1818166 ns |
0.99 |
array/private/iteration/scalar |
3896583.5 ns |
4215375 ns |
0.92 |
array/private/iteration/logical |
2543354 ns |
2643833 ns |
0.96 |
array/private/iteration/findmin/1d |
1844125 ns |
1959625 ns |
0.94 |
array/private/iteration/findmin/2d |
1411020.5 ns |
1525062.5 ns |
0.93 |
array/private/copy |
550250 ns |
563000 ns |
0.98 |
array/shared/copyto!/gpu_to_gpu |
85792 ns |
83167 ns |
1.03 |
array/shared/copyto!/cpu_to_gpu |
86791 ns |
83500 ns |
1.04 |
array/shared/copyto!/gpu_to_cpu |
82958 ns |
82792 ns |
1.00 |
array/shared/iteration/findall/int |
1647291 ns |
1691958 ns |
0.97 |
array/shared/iteration/findall/bool |
1467375 ns |
1524959 ns |
0.96 |
array/shared/iteration/findfirst/int |
1368521 ns |
1420354.5 ns |
0.96 |
array/shared/iteration/findfirst/bool |
1362166.5 ns |
1403958 ns |
0.97 |
array/shared/iteration/scalar |
209375 ns |
158834 ns |
1.32 |
array/shared/iteration/logical |
2354042 ns |
2536750 ns |
0.93 |
array/shared/iteration/findmin/1d |
1400750 ns |
1418937.5 ns |
0.99 |
array/shared/iteration/findmin/2d |
1423666 ns |
1545646 ns |
0.92 |
array/shared/copy |
251708 ns |
246937.5 ns |
1.02 |
array/permutedims/4d |
2434292 ns |
2531354 ns |
0.96 |
array/permutedims/2d |
1169312.5 ns |
1248000 ns |
0.94 |
array/permutedims/3d |
1723729.5 ns |
1810958 ns |
0.95 |
metal/synchronization/stream |
19000 ns |
14875 ns |
1.28 |
metal/synchronization/context |
20292 ns |
15584 ns |
1.30 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Is there a Metal3->4 migration document anywhere indicating we should switch from MTLCommandBuffer::waitUntilCompleted to a shared event?
Nope. I couldn't find anything from Apple's documentation that mentioned full command queue synchronization. Seems like they're more focused on more granular synchronization with the new API. The benchmarks seem to indicate that this is slower for Metal 3, so maybe we leave it as-is for metal 3 and only transition to event-based synchronization for Metal 4 command queues. What made me believe that this is how they expect us to do it with Metal 4 is that I'm closing this PR to avoid polluting the PR list, and I linked to this PR in #612 for when I (or someone else) get around to finishing metal 4 support. We can always reopen if someone thinks we should use |
Synchronization itself seems slower, but all other operations seem slightly faster. Or is this just the noise of the benchmarks? It's awfully consistent. And I could imagine the microbenchmark being slower while real-use application being beneficial. |
Good point. I'll rerun and if results are similar I'll merge |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #633 +/- ##
==========================================
- Coverage 80.97% 80.96% -0.01%
==========================================
Files 61 61
Lines 2712 2722 +10
==========================================
+ Hits 2196 2204 +8
- Misses 516 518 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This reverts commit 1942968.
By reusing the same shared event, reduces the number of allocations.
Hopefully slightly improves performance, but this is more in line with how I believe queue synchronization will have to work with Metal 4.
Edit: performance slightly worse.